Introduction and settings

In this tutorial, we will be analyzing sample N1 from a dataset of tumor and immune cell populations from human early-stage lung adenocarcinomas (He et al., Oncogene 2021). Data has been obtained using 10X Genomics technology. The sample N1 is composed of 14,972 single cells that were sequenced on the Illumina Hiseq X platform. Raw data can be download from here. Prior to the analysis with popsicleR, raw data has been processed with Cell Ranger software to align reads and generate feature-barcode matrices. Cell Ranger data, comprising the gzipped TSV files of the feature-barcode matrix and of feature and barcode sequences, can be found here.
We start assigning a name to the sample (i.e., N1) and defining the path to the folder containing raw input data. popsicleR accepts as input either files from the Cell Ranger pipeline of 10X Genomics or a feature-barcode matrix of raw counts generated from any microfluidic-, microwell plates-, or droplet-based scRNA-seq technology. To organize input, it is convenient to create a directory named as the sample name and containing a sub-folder with the input data (either in the form of Cell Ranger gzipped TSV files or of a tab-delimited matrix). In this example, the Cell Ranger input files are stored in the cellranger_data folder. Some steps of the analysis that involve Seurat functions can be performed using parallelization following the instructions reported here. Throughout the various analysis steps, all parameters values are saved into a tab-delimited file named popsicleR.log.

library(popsicleR)
## Hello and welcome to popsicleR, an interactive Pipeline Of Preprocessing for SIngle CelL Experiment data.
## In this workflow, messages are colour coded: 
## green messages will provide information on the ongoing step 
## cyan messages will require the user to provide an input for advancing the script 
## red messages will report missing information or wrong inputs 
## grey messages report functions and software system outputs.
# set the working directory and sample name
sample.name <- "N1"
main.dir <- file.path("~","PoPsicleExample",sample.name)
input.data.dir <- file.path(main.dir,"cellranger_data")
setwd(main.dir)


Quality controls

We start the analysis using the PrePlots function. PrePlots first reads the output of the cellranger pipeline from 10X through the Read10X() function of the Seurat package and returns a unique molecular identified (UMI) count matrix. The values in the count matrix represent the number of molecules for each feature (i.e., gene; row) that are detected in each cell (column). During the generation of the count matrix, a soft filter is applied to remove low quality cells. By default, PrePlots removes genes that are expressed in less than 0.1% of cells (percentage = 0.1) and cells that express less than 200 genes (gene_filter = 200). Additional parameters specify the organism (organism = "human") and if data has been generated using cellranger (cellranger = TRUE). If cellranger is set to FALSE, input_data must be the name of a feature-barcode matrix file. PrePlots returns violin (Figure 1), density (Figure 2), and scatter (Figure 3) plots to easily explore QC metrics and set user-defined filtering thresholds. These plots report commonly used QC metrics as the number of unique genes detected in each cell (nFeature_RNA), the total number of molecules detected within a cell (nCount_RNA), the percentage of reads that map to the mitochondrial genome (percent_mt), and to ribosomal (percent_ribo) and dissociation genes (percent_disso). To help the user avoiding the removal of cells intrinsically characterized by outlying levels of unique genes, total UMIs, and mitochondrial fractions, PrePlots also returns the cell abundance and the distributions of these features within specific cell populations identified by a list of user-defined marker genes (e.g., SFTPC for alveolar cells and CD3D for T cells; Figure 4-5). All plots are saved in the 01.QC_Plots folder within the working directory.

# Define a list of cell-type markers
populations.markers <-  c("EPCAM","VIM","COL1A1","PECAM1","PTPRC","CD3D","CD14","SFTPC")

# Create a Seurat object from the raw data and visualize QC metrics
sample.umi <- PrePlots(sample = sample.name, input_data = input.data.dir, genelist = populations.markers, percentage = 0.1, gene_filter = 200, cellranger = TRUE, organism = "human", out_folder = main.dir)
## 
## Plotting QC Violin plots
## Plots saved in:  01.QC_Plots\01a_QC_violin_plots.pdf 
## Plotting QC Density plots
## Plots saved in:  01.QC_Plots\01b_QC_Hist_nGene_nUMI_MTf_Ribo.pdf 
## Plotting QC Scatter plots
## Plots saved in:  01.QC_Plots\01c_QC_Scatter_nGene_nUMI_MTf.pdf 
## Plotting QC per gene Histograms
## Plots saved in:  01.QC_Plots\01d_QC_Hist_Check.pdf 
## Plotting QC per gene Scatter plots
## Plots saved in:  01.QC_Plots\01e_QC_Scatter_Check.pdf 
## 
## Now check the graphs, choose your thresholds and then run FilterPlots

<b>Figure 1</b>. Violin plots.

Figure 1. Violin plots.


<b>Figure 2</b>. Density plots.

Figure 2. Density plots.


<b>Figure 3</b>. Scatter plots.

Figure 3. Scatter plots.


<b>Figure 4</b>. Density plots for cells expressing CD3D, a marker of T lymphocytes.

Figure 4. Density plots for cells expressing CD3D, a marker of T lymphocytes.


<b>Figure 5</b>. Scatter plots for cells expressing CD3D, a marker of T lymphocytes.

Figure 5. Scatter plots for cells expressing CD3D, a marker of T lymphocytes.


Filtering

Appropriate custom filters to discard low-quality cells can be applied using the FilterPlots function. Cells can be filtered based on the minimum and maximum number of genes detected in each cell (G_RNA_low and G_RNA_hi), the minimum and maximum number of molecules detected within a cell (U_RNA_low and U_RNA_hi), and the maximum percentage of mitochondrial, ribosomal, and dissociation genes (percent_mt_hi, percent_ribo_hi and percent_disso_hi, respectively). Once thresholds are applied, FilterPlots returns summary graphs with data distributions and correlations that highlight with colors cells passing the filtering procedure (Figure 6). Filtering thresholds are displayed as gray lines. All plots are saved in the 01.QC_Plots folder.

# Filter the Seurat object based on QC metrics
sample.umi <- FilterPlots(UMI = sample.umi, G_RNA_low = 500, G_RNA_hi= Inf, U_RNA_low = -Inf, U_RNA_hi = 37000, percent_mt_hi = 10, percent_ribo_hi = 100, percent_disso_hi = 100, out_folder = main.dir)
## The selected thresholds will filter 3640 cells
## 
## Plotting QC final plots
## 
## Plots saved in:  01.QC_Plots\01f_final_Hist_plots.pdf 
## Plotting QC final scatter plots
## Plots saved in:  01.QC_Plots\01g_final_Scatter_plots.pdf 
## 
## Next suggested step is Doublets Calculation, run CalculateDoublets 
## 
## WARNING: 
## It is recommended to first run CalculateDoublets step setting dbs_thr ='none' or dbs_rate =NULL and dbs_remove= FALSE. 
## Once checked the graphs it is possible to re-run this step specifying a custom threshold 
## through the 'dbs_thr' or 'dbs_rate' parameter and removing all the cells identified as doublets setting 'dbs_remove' parameter as TRUE.

<b>Figure 6</b>. Post filtering scatter plots. Cells passing the filtering procedure are highlighted with colors.

Figure 6. Post filtering scatter plots. Cells passing the filtering procedure are highlighted with colors.


Doublet detection

As an additional QC and filtering step, the function CalculateDoublets allows detecting and, eventually, removing cell doublets, i.e., groups of cells captured and associated to the same unique barcode. Cell doublets can be identified using the R version of the python module Scrublet (Wolock et al., Cell Syst 2019),implemented based on the code provided by Chengxian Qiu at (https://rdrr.io/github/ChengxiangQiu/rscrublet/), or scDblFinder (Germain et al., F1000Research 2021). Both methods use k-nearest neighbor (KNN) classifiers to distinguish doublets from singlets. When using Scrublet (method = "scrublet"), the function returns histograms of observed and simulated doublet score distributions, the value of the automatically detected threshold for the simulated doublet score (Figure 7). scDblFinder (method = "scDblFinder") is used with default parameters, but the user can change the value of the expected doublet rate (dbs_rate) that, for 10x data, is assumed to be 1% per thousand cells captured. For both methods,CalculateDoubletsvisualizes detected cell doublets and doublet scores on a UMAP projection (Figure 8), adds doublet classifications and scores in the metadata of the Seurat object, and eventually removes cells classified as doublets setting the dbs_remove argument to TRUE. All plots are saved in the 01.QC_Plots folder. In this tutorial, we identified doublets using method = "scrublet" (default) to illustrate the visualization of the observed and simulated doublet score distributions. However, it is worth noting that when used with Scrublet, CalculateDoublets can take several minute to complete, while the function runs faster with scDblFinder.

# Doublet detection
sample.umi <- CalculateDoublets(UMI = sample.umi, method = "scrublet", dbs_thr ='none', dbs_remove = FALSE, out_folder = main.dir)
## 
## Plotting histogram of doublet scores
## Plots saved in:  01.QC_Plots\01i_histogram of doublet scores.pdf 
## Plotting doublets UMAP
## Plots saved in:  01.QC_Plots\01h_doublets_umap.pdf 
## Once checked the graphs it is possible to re-run this step specifying a custom threshold through the 'dbs_thr' parameter and removing all the cells identified as doublets setting 'dbs_remove' parameter as TRUE.

<b>Figure 7</b>. Histograms of observed and simulated doublet score distributions. The solid line indicates the automatically detected threshold for the simulated doublet score.

Figure 7. Histograms of observed and simulated doublet score distributions. The solid line indicates the automatically detected threshold for the simulated doublet score.


<b>Figure 8</b>. Visualization of the cell doublets and of the doublet scores on a UMAP projection.

Figure 8. Visualization of the cell doublets and of the doublet scores on a UMAP projection.


When using Scrublet (method = scrublet), the threshold for the simulated doublet score can be manually modified setting the dbs_thr based on visual inspection of the observed and simulated doublet score distributions and of the UMAP projection. In this example, we changed the automatically detected threshold setting dbs_thr = 0.22 and remove doublets.

# Doublet removal
sample.umi <- CalculateDoublets(UMI = sample.umi, method = "scrublet", dbs_thr = 0.22, dbs_remove = TRUE, out_folder = main.dir)
## Removing 362 Doublets 
## Next suggested step is data normalization, run Normalize


Data normalization

After removing unwanted cells, data are normalized using the Normalize function. By default, Normalize implements the global-scaling normalization method “LogNormalize” of the Seurat NormalizeData function. LogNormalize normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. The Normalize function also calculates a subset of features that exhibit high cell-to-cell variation in the dataset (i.e., that are highly expressed in some cells, and lowly expressed in others) and that will be used in downstream analysis, like PCA dimensional reduction. This step is based on the FindVariableFeatures function of the Seurat package (with default of 2,000 features, i.e., variable_genes = 2,000). Signal distributions before and after normalization and plots highlighting the highly variable genes are saved in the 02.PreProcessing folder.

# Normalization
sample.umi <- Normalize(UMI = sample.umi, variable_genes = 2000, out_folder = main.dir)
## 
## Plotting Normalization graphs
## Plots saved in:  02.PreProcessing\02a_total_expression_after_before_norm.pdf 
## Plotting High Variables Genes
## Plots saved in:  02.PreProcessing\02b_plot_FindVariableGenes.pdf 
## Next suggested step is regression, run  ApplyRegression 
## It is suggested to apply regression without providing any variables to regress and without exploring PCs (explore_PC=FALSE) to save time and computational effort.
## Regression variables can be chosen through visual inspection of this function outputs 
## Once selected, set explore_PC as TRUE and explore graphs to identify the PCs number to use in your analysis


Scaling and regression

Before dimensionality reduction, scaling and, eventually, regression must be performed. Scaling is required for PCA and makes the data to have zero-mean and unit standard deviation (standardization). Regression is used to mitigate the effect of unwanted sources of variation, as experimental and biological biases. One source of biological variation is represented by cell cycle that is quantified before scaling or regression. This quantification exploits the method implemented in the Seurat CellCycleScoring function to assign each cell with cell cycle scores and a predicted cell cycle phase, based on the expression of G2/M and S phase markers. ApplyRegression simply scales the data, if no variable to regress is indicated (default, variables = "none"), or performs regression on several variables specified by the user. The variables argument can be a combination of, e.g., the number of genes (nFeature_RNA), UMIs (nCount_RNA), percentage of mithocondrial genes (percent_mt), and the cell cycle scores (S.Score and G2M.Score). The function uses the functionality of Seurat ScaleData. ApplyRegression returns projections on reduced spaces (PCA, t-SNE, and UMAP using 20 principal components) to inspect the effect of variable regression (Figure 9-10). If the explore_PC is set to TRUE, once completed scaling and, eventually, regression, ApplyRegression performs PCA on the scaled (or regressed) data and returns heatmap (Seurat DimHeatmap), jackstraw (Seurat JackStrawPlot), and elbow (Seurat ElbowPlot) plots to determine the optimal number of principal components to include in the downstream analyses. All plots are saved in the 02.PreProcessing folder and in sub-folders named after the regressed variables. In this example, we first simply scale the data (i.e., variables = "none") and then perform regression on cell cycle scores (i.e., variables = c("S.Score", "G2M.Score")).

# Regression
sample.umi <- ApplyRegression(UMI = sample.umi, organism = "human", variables = "none", explore_PC = FALSE, out_folder = main.dir)
## Calculating Cell Cycle Score 
## Plotting dimensional reduction graphs with no regression
## Plots saved in:  02.PreProcessing dedicated subfolder

<b>Figure 9</b>. Visualization of the number of genes, UMIs, percentage of mithocondrial genes, and the cell cycle scores on the PCA embedding after scaling.

Figure 9. Visualization of the number of genes, UMIs, percentage of mithocondrial genes, and the cell cycle scores on the PCA embedding after scaling.


# Regression
sample.umi <- ApplyRegression(UMI = sample.umi, organism = "human", variables = c("S.Score", "G2M.Score"), explore_PC = FALSE, out_folder = main.dir)
## Plotting dimensional reduction graphs after regression
## Plots saved in:  02.PreProcessing dedicated subfolder

<b>Figure 10</b>. Visualization of the number of genes, UMIs, percentage of mithocondrial genes, and the cell cycle scores on the PCA embedding after regression on the cell cycle scores.

Figure 10. Visualization of the number of genes, UMIs, percentage of mithocondrial genes, and the cell cycle scores on the PCA embedding after regression on the cell cycle scores.


Once inspected the post-regression PCA, t-SNE, and UMAP plots, we decide to continue with scaling only and to generate the plot to determine the optimal number of principal components (‘explore_PC = TRUE’).

# Regression
sample.umi <- ApplyRegression(UMI = sample.umi, organism = "human", variables = "none", explore_PC = TRUE, out_folder = main.dir)
## Plotting dimensional reduction graphs with no regression
## Plots saved in:  02.PreProcessing dedicated subfolder 
## Plotting graphs to explore PCs
## Plots saved in:  02.PreProcessing dedicated subfolder 
## 
## Once identified the PCs number to use in your analysis, perform clustering running CalculateCluster


Cell clustering

The CalculateCluster function exploits the graph-based clustering approach implemented through the FindNeighbors and FindClusters functions of Seurat using as input the previously identified number of principal components (dim_pca) and the resolution parameter (cluster_res) that sets the granularity of the clustering, with increasing values leading to a greater number of clusters. The resolution parameter can be either a single value or a set of values to explore cell grouping at different resolutions. Cell clusters are visualized in dimensionality reduced planes as UMAP or t-SNE embeddings (Figure 11), while a phylogenetic tree is used to show correlation between the identified clusters (Figure 12). Beside cluster membership, in UMAP and t-SNE embeddings, cells are colored according to several features that might aid in the assessment of cluster quality or in the identification of clusters of problematic cells that escaped the previous filtering steps (Figure 13). If multiple resolutions have been selected, the clustree package is used to produce a clustering tree that shows how cells are assigned to the clusters at the various clustering resolutions (Figure 14). Finally, CalculateCluster returns a dot plot for a set of user-defined genes (marker.list argument; default is a set of genes marking immune cells) in the clusters identified at the given resolutions (Figure 15). All plots are saved in the 03.Clustering folder. In this example we select dim_pca = 12 principal components, a cluster resolution varying from 0.4 to 0.8 (i.e.,cluster_res =c(0.4,0.6,0.8)) and default marker genes (marker.list = "none").

# Clustering at multiple resolutions
sample.umi <- CalculateCluster(UMI = sample.umi, dim_pca = 12, organism = "human", marker.list = "none", PCA = TRUE, cluster_res = c(0.4,0.6,0.8), out_folder=main.dir)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 10881
## Number of edges: 366064
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9135
## Number of communities: 14
## Elapsed time: 1 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 10881
## Number of edges: 366064
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8915
## Number of communities: 17
## Elapsed time: 1 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 10881
## Number of edges: 366064
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8715
## Number of communities: 19
## Elapsed time: 1 seconds
## Plotting Markers graphs for each cluster
## Plots saved in:  03.Clustering folder 
## Next suggested step is annotation, run MakeAnnotation

<b>Figure 11</b>. UMAP visualization of cells colored by cluster membership at resolution 0.8. Numbers indicate the cluster identity.

Figure 11. UMAP visualization of cells colored by cluster membership at resolution 0.8. Numbers indicate the cluster identity.


<b>Figure 12</b>. Phylogenetic tree showing correlation between the identified clusters at resulution 0.8.

Figure 12. Phylogenetic tree showing correlation between the identified clusters at resulution 0.8.


<b>Figure 13</b>. UMAP visualization of cells colored by percent of ribosomal and dissociation genes and MALAT1 and GAPDH expression levels.

Figure 13. UMAP visualization of cells colored by percent of ribosomal and dissociation genes and MALAT1 and GAPDH expression levels.


<b>Figure 14</b>. Clustering tree displaying how cells move among clusters as resolution increases from 0.4 to 0.8. Cluster identity is displayed inside the circle; circle color indicates the clustering resolution; circle size is proportional to the number of cells in the cluster. Edge color is related to the number of cells whereas its transparency shows how many cells of the incoming node are assigned to the destination node.

Figure 14. Clustering tree displaying how cells move among clusters as resolution increases from 0.4 to 0.8. Cluster identity is displayed inside the circle; circle color indicates the clustering resolution; circle size is proportional to the number of cells in the cluster. Edge color is related to the number of cells whereas its transparency shows how many cells of the incoming node are assigned to the destination node.


<b>Figure 15</b>. Dot plot reporting, in the clusters identified at resolution 0.8, the expression of genes commonly associated to general immune populations. Dot size is proportional to the fraction of cells with non-zero expression of the marker gene. Dots are color-coded based on the average scaled normalized expression.

Figure 15. Dot plot reporting, in the clusters identified at resolution 0.8, the expression of genes commonly associated to general immune populations. Dot size is proportional to the fraction of cells with non-zero expression of the marker gene. Dots are color-coded based on the average scaled normalized expression.


In case a single resolution is indicated, CalculateCluster uses the FindAllMarkers function of Seurat to determine the differentially expressed genes in the comparison between cells in each cluster and all remaining cells (cluster markers). FindAllMarkers is applied with the minimum fraction of cells min.pct in either of the two populations set to 0.25 and limiting the test to genes showing, on average, at least logfc.threshold = 0.25 fold difference between the two groups of cells. The function returns a table, in the form of a tab-delimited file (named as 03_markers@res(cluster_res).txt), containing, for each cluster, a list of its putative markers and their associated statistics. Expression values of the top 10 markers of each cluster are used to generate a heatmap and the expression of the top 2 markers is visualized through t-SNE, UMAP, and violin plots (Figure 16). In this example we select a cluster resolution equal to 0.8 (i.e.,cluster_res =0.8) and a list of generic markers for immune cells (markers = "Immune_cells_human.txt") saved in the current directory as a tab delimited file (available here).

# Clustering at a single resolution
sample.umi <- CalculateCluster(UMI = sample.umi, dim_pca = 12, organism = "human", marker.list = "Immune_cells_human.txt", PCA = TRUE, cluster_res = 0.8, out_folder=main.dir)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 10881
## Number of edges: 366064
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8715
## Number of communities: 19
## Elapsed time: 1 seconds
## Plotting Top Markers graphs for each cluster
## Plotting Markers graphs for each cluster
## Plots saved in:  03.Clustering folder 
## Next suggested step is annotation, run MakeAnnotation

<b>Figure 16</b>. Violin plot of the top 2 marker genes for cluster 7 at resolution 0.8.

Figure 16. Violin plot of the top 2 marker genes for cluster 7 at resolution 0.8.


Cell annotation

The assignment of cell type identity to clusters is achieved using the MakeAnnotation function. MakeAnnotation annotates cells using SingleR and scMCA built-in references. Specifically, SingleR is used to annotate data from human and mouse samples with the references of the celldex package. For human samples, celldex includes the Human Primary Cell Atlas (HPCA) and the Blueprint Encode (BpEn) references generated from microarrays of human primary cells and RNA-seq profiles of human stroma and immune cells, respectively. For mouse samples, references are derived from a collection of mouse bulk RNA-seq datasets (Mouse RNA-seq) and from the microarray profiles of pure mouse immune cells provided by the Immunological Genome Project (ImmGen). scMCA is used to annotate data from mouse samples with a reference constructed from the single-cell Mouse Cell Atlas (scMCA) and covering all mouse cell types. Single cell annotations for the different references are displayed color-coding cells in t-SNE and UMAP embeddings. Since annotation can be performed either at the cluster or at the single cell level, cells can be colored based on clusters (at single or multiple resolution setting the cluster_res parameter; Figure 17) or on cell type identity (Figure 18). In the annotation at single cell level, it can be difficult to discern the different populations in the low-dimensional embeddings, as cells are overlapping, and the color scales might be inefficient in separating the various cell types. To overcome this limitation, MakeAnnotation produces a plot for each identified population with only cells of the given type highlighted in color on top of the other cells colored in gray (Figure 19). Finally, MakeAnnotation returns dot plots for a set of user-defined genes in the predicted cell populations (marker.list argument as a tab-delimited file with headers saved in the working directory; default is a set of genes marking immune cells; Figure 20) and a pie-chart plot showing the clusters composition according to the various annotations (Figure 21). All plots are saved in the 04.Annotations folder.

# Cell type annotation
sample.umi <- MakeAnnotation(UMI = sample.umi, organism = "human", marker.list = "none", cluster_res= 0.8, out_folder = main.dir)
## Plotting single cell and cluster annotations
## Plotting dimensional reduction graphs for each population found in the sample
## Plots saved in:  \04.Annotation\ folder

<b>Figure 17</b>. UMAP visualization of cell clusters colored by population label assigned using the Blueprint ENCODE reference atlas.

Figure 17. UMAP visualization of cell clusters colored by population label assigned using the Blueprint ENCODE reference atlas.


<b>Figure 18</b>. UMAP visualization of single cells colored by population label assigned using the Blueprint ENCODE reference atlas.

Figure 18. UMAP visualization of single cells colored by population label assigned using the Blueprint ENCODE reference atlas.


<b>Figure 19</b>. UMAP visualization of single cells (annotated using the Blueprint ENCODE reference atlas) with CD8+ T cells highlighted in red on top of the other cells colored in gray.

Figure 19. UMAP visualization of single cells (annotated using the Blueprint ENCODE reference atlas) with CD8+ T cells highlighted in red on top of the other cells colored in gray.


<b>Figure 20</b>. Dot plot reporting, in the cell types assigned using the Blueprint ENCODE reference atlas, the expression of genes commonly associated to general immune populations. Dot size is proportional to the fraction of cells with non-zero expression of the marker gene. Dots are color-coded based on the average scaled normalized expression.

Figure 20. Dot plot reporting, in the cell types assigned using the Blueprint ENCODE reference atlas, the expression of genes commonly associated to general immune populations. Dot size is proportional to the fraction of cells with non-zero expression of the marker gene. Dots are color-coded based on the average scaled normalized expression.


<b>Figure 21</b>. Pie-chart plot displaying, for each cluster at resolution 0.8, the proportion of cells annotated with a specific label using the Blueprint ENCODE reference atlas.

Figure 21. Pie-chart plot displaying, for each cluster at resolution 0.8, the proportion of cells annotated with a specific label using the Blueprint ENCODE reference atlas.


Finally, the Seurat object that includes all the calculated cell features in the metadata slot can be saved in a dedicated output directory for downstream analyses.

output.data.dir <- file.path(main.dir,"RDS objects")
if (!file.exists(output.data.dir)){dir.create(output.data.dir)}
saveRDS(sample.umi,file = file.path(output.data.dir,paste0(sample.name,".Rds")))
Session Info
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] popsicleR_0.1.8             scMCA_0.2.0                
##  [3] scDblFinder_1.8.0           celldex_1.4.0              
##  [5] umap_0.2.7.0                neldermead_1.0-12          
##  [7] optimsimplex_1.0-8          optimbase_1.0-10           
##  [9] Matrix_1.4-0                RANN_2.6.1                 
## [11] ggplotify_0.1.0             RColorBrewer_1.1-2         
## [13] ggExtra_0.9                 crayon_1.5.0               
## [15] patchwork_1.1.1             limma_3.50.1               
## [17] magrittr_2.0.2              gridExtra_2.3              
## [19] session_1.0.3               future_1.24.0              
## [21] SingleR_1.8.1               SummarizedExperiment_1.24.0
## [23] Biobase_2.54.0              GenomicRanges_1.46.1       
## [25] GenomeInfoDb_1.30.1         IRanges_2.28.0             
## [27] S4Vectors_0.32.3            BiocGenerics_0.40.0        
## [29] MatrixGenerics_1.6.0        matrixStats_0.61.0         
## [31] gtools_3.9.2                ape_5.6-2                  
## [33] clustree_0.4.4              ggraph_2.0.5               
## [35] corrplot_0.92               ggplot2_3.3.5              
## [37] dplyr_1.0.8                 R.utils_2.11.0             
## [39] R.oo_1.24.0                 R.methodsS3_1.8.1          
## [41] reticulate_1.24             SeuratObject_4.0.4         
## [43] Seurat_4.1.0               
## 
## loaded via a namespace (and not attached):
##   [1] rappdirs_0.3.3                scattermore_0.8              
##   [3] tidyr_1.2.0                   bit64_4.0.5                  
##   [5] knitr_1.37                    irlba_2.3.5                  
##   [7] DelayedArray_0.20.0           data.table_1.14.2            
##   [9] rpart_4.1.16                  KEGGREST_1.34.0              
##  [11] RCurl_1.98-1.6                generics_0.1.2               
##  [13] ScaledMatrix_1.2.0            cowplot_1.1.1                
##  [15] RSQLite_2.2.10                bit_4.0.4                    
##  [17] spatstat.data_2.1-2           httpuv_1.6.5                 
##  [19] assertthat_0.2.1              viridis_0.6.2                
##  [21] xfun_0.29                     jquerylib_0.1.4              
##  [23] evaluate_0.15                 promises_1.2.0.1             
##  [25] fansi_1.0.2                   dbplyr_2.1.1                 
##  [27] igraph_1.2.11                 DBI_1.1.2                    
##  [29] htmlwidgets_1.5.4             spatstat.geom_2.3-2          
##  [31] purrr_0.3.4                   ellipsis_0.3.2               
##  [33] RSpectra_0.16-0               backports_1.4.1              
##  [35] deldir_1.0-6                  sparseMatrixStats_1.6.0      
##  [37] vctrs_0.3.8                   SingleCellExperiment_1.16.0  
##  [39] ROCR_1.0-11                   abind_1.4-5                  
##  [41] cachem_1.0.6                  withr_2.4.3                  
##  [43] ggforce_0.3.3                 checkmate_2.0.0              
##  [45] sctransform_0.3.3             scran_1.22.1                 
##  [47] goftest_1.2-3                 cluster_2.1.2                
##  [49] ExperimentHub_2.2.1           lazyeval_0.2.2               
##  [51] labeling_0.4.2                edgeR_3.36.0                 
##  [53] pkgconfig_2.0.3               tweenr_1.0.2                 
##  [55] nlme_3.1-155                  vipor_0.4.5                  
##  [57] rlang_1.0.1                   globals_0.14.0               
##  [59] lifecycle_1.0.1               miniUI_0.1.1.1               
##  [61] filelock_1.0.2                BiocFileCache_2.2.1          
##  [63] rsvd_1.0.5                    AnnotationHub_3.2.1          
##  [65] polyclip_1.10-0               lmtest_0.9-39                
##  [67] zoo_1.8-9                     beeswarm_0.4.0               
##  [69] pheatmap_1.0.12               ggridges_0.5.3               
##  [71] png_0.1-7                     viridisLite_0.4.0            
##  [73] bitops_1.0-7                  KernSmooth_2.23-20           
##  [75] Biostrings_2.62.0             blob_1.2.2                   
##  [77] DelayedMatrixStats_1.16.0     stringr_1.4.0                
##  [79] parallelly_1.30.0             spatstat.random_2.1-0        
##  [81] gridGraphics_0.5-1            beachmat_2.10.0              
##  [83] scales_1.1.1                  memoise_2.0.1                
##  [85] plyr_1.8.6                    ica_1.0-2                    
##  [87] zlibbioc_1.40.0               compiler_4.1.2               
##  [89] dqrng_0.3.0                   fitdistrplus_1.1-6           
##  [91] cli_3.2.0                     XVector_0.34.0               
##  [93] listenv_0.8.0                 pbapply_1.5-0                
##  [95] MASS_7.3-55                   mgcv_1.8-38                  
##  [97] tidyselect_1.1.2              stringi_1.7.6                
##  [99] highr_0.9                     yaml_2.3.5                   
## [101] BiocSingular_1.10.0           askpass_1.1                  
## [103] locfit_1.5-9.4                ggrepel_0.9.1                
## [105] sass_0.4.0                    tools_4.1.2                  
## [107] future.apply_1.8.1            parallel_4.1.2               
## [109] rstudioapi_0.13               bluster_1.4.0                
## [111] metapod_1.2.0                 farver_2.1.0                 
## [113] Rtsne_0.15                    digest_0.6.29                
## [115] BiocManager_1.30.16           shiny_1.7.1                  
## [117] Rcpp_1.0.8                    scuttle_1.4.0                
## [119] BiocVersion_3.14.0            later_1.3.0                  
## [121] RcppAnnoy_0.0.19              httr_1.4.2                   
## [123] AnnotationDbi_1.56.2          colorspace_2.0-3             
## [125] tensor_1.5                    splines_4.1.2                
## [127] statmod_1.4.36                uwot_0.1.11                  
## [129] yulab.utils_0.0.4             spatstat.utils_2.3-0         
## [131] scater_1.22.0                 graphlayouts_0.8.0           
## [133] xgboost_1.5.2.1               shinythemes_1.2.0            
## [135] plotly_4.10.0                 xtable_1.8-4                 
## [137] jsonlite_1.8.0                tidygraph_1.2.0              
## [139] R6_2.5.1                      pillar_1.7.0                 
## [141] htmltools_0.5.2               mime_0.12                    
## [143] DT_0.20                       glue_1.6.2                   
## [145] fastmap_1.1.0                 BiocParallel_1.28.3          
## [147] BiocNeighbors_1.12.0          interactiveDisplayBase_1.32.0
## [149] codetools_0.2-18              utf8_1.2.2                   
## [151] lattice_0.20-45               bslib_0.3.1                  
## [153] spatstat.sparse_2.1-0         tibble_3.1.6                 
## [155] curl_4.3.2                    ggbeeswarm_0.6.0             
## [157] leiden_0.3.9                  openssl_1.4.6                
## [159] survival_3.2-13               rmarkdown_2.11               
## [161] munsell_0.5.0                 GenomeInfoDbData_1.2.7       
## [163] reshape2_1.4.4                gtable_0.3.0                 
## [165] spatstat.core_2.4-0